Objective: create a histogram that shows the binned annual compensations of respondents.
This notebook requires binning data, and creating the bin labels and intervals, to arrive at the final histogram.
Here we import the necessary modules
import pandas as pd
from os import getcwd, path
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()
And load the CSV
path_to_data = path.join(getcwd(), "data", "survey_results_public.csv")
data = pd.read_csv(path_to_data)
data
| Respondent | MainBranch | Hobbyist | Age | Age1stCode | CompFreq | CompTotal | ConvertedComp | Country | CurrencyDesc | ... | SurveyEase | SurveyLength | Trans | UndergradMajor | WebframeDesireNextYear | WebframeWorkedWith | WelcomeChange | WorkWeekHrs | YearsCode | YearsCodePro | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | I am a developer by profession | Yes | NaN | 13 | Monthly | NaN | NaN | Germany | European Euro | ... | Neither easy nor difficult | Appropriate in length | No | Computer science, computer engineering, or sof... | ASP.NET Core | ASP.NET;ASP.NET Core | Just as welcome now as I felt last year | 50.0 | 36 | 27 |
| 1 | 2 | I am a developer by profession | No | NaN | 19 | NaN | NaN | NaN | United Kingdom | Pound sterling | ... | NaN | NaN | NaN | Computer science, computer engineering, or sof... | NaN | NaN | Somewhat more welcome now than last year | NaN | 7 | 4 |
| 2 | 3 | I code primarily as a hobby | Yes | NaN | 15 | NaN | NaN | NaN | Russian Federation | NaN | ... | Neither easy nor difficult | Appropriate in length | NaN | NaN | NaN | NaN | Somewhat more welcome now than last year | NaN | 4 | NaN |
| 3 | 4 | I am a developer by profession | Yes | 25.0 | 18 | NaN | NaN | NaN | Albania | Albanian lek | ... | NaN | NaN | No | Computer science, computer engineering, or sof... | NaN | NaN | Somewhat less welcome now than last year | 40.0 | 7 | 4 |
| 4 | 5 | I used to be a developer by profession, but no... | Yes | 31.0 | 16 | NaN | NaN | NaN | United States | NaN | ... | Easy | Too short | No | Computer science, computer engineering, or sof... | Django;Ruby on Rails | Ruby on Rails | Just as welcome now as I felt last year | NaN | 15 | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 64456 | 64858 | NaN | Yes | NaN | 16 | NaN | NaN | NaN | United States | NaN | ... | NaN | NaN | NaN | Computer science, computer engineering, or sof... | NaN | NaN | NaN | NaN | 10 | Less than 1 year |
| 64457 | 64867 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Morocco | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 64458 | 64898 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Viet Nam | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 64459 | 64925 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Poland | NaN | ... | NaN | NaN | NaN | NaN | Angular;Angular.js;React.js | NaN | NaN | NaN | NaN | NaN |
| 64460 | 65112 | NaN | Yes | NaN | NaN | NaN | NaN | NaN | Spain | NaN | ... | NaN | NaN | NaN | Computer science, computer engineering, or sof... | ASP.NET Core;jQuery | Angular;Angular.js;ASP.NET Core;jQuery | NaN | NaN | NaN | NaN |
64461 rows × 61 columns
We can keep only the columns needed for this notebook
data = data[["ConvertedComp"]]
We only need to keep the column of annual compensation (USD).
In order to create the final histogram, first we need to:
Since I go through the exploratory data analysis part in the "Age of Respondents" notebook, we will have only the filter operation in this one, not the exploration of where to set the limits.
# Python doesn't "read" the underscore in the two hundred thousand, but makes the number
# more readable for us
data = data[(data["ConvertedComp"] >= 0) & (data["ConvertedComp"] <= 200_000)]
print(f"Rows left: {data.shape[0]:,}")
Rows left: 32,450
For creating the bins, we'll make use of list comprehensions for compact code and pandas' IntervalIndex.
Let's start by creating the bin labels. In this case they will be almost the same as the bins themselves, except the labels are written in thousands of USD for readability.
# Create bins of 15,000 USD, written in the thousands
# Hence dividing by 1,000 and the K at the end
# These bins will be closed on the left and open on the right
bin_labels = [
f"[{int(i / 1_000):,}K, {int((i + 15_000) / 1_000):,}K)"
for i in range(0, 200_001, 15_000)
]
bin_labels
['[0K, 15K)', '[15K, 30K)', '[30K, 45K)', '[45K, 60K)', '[60K, 75K)', '[75K, 90K)', '[90K, 105K)', '[105K, 120K)', '[120K, 135K)', '[135K, 150K)', '[150K, 165K)', '[165K, 180K)', '[180K, 195K)', '[195K, 210K)']
The bin ranges are very similar to the labels, but this second list comprehension creates tuples of integers instead of custom strings with the open and closed nomenclature. And as mentioned before, the actual bin ranges are not converted to the thousands, otherwise we'd need an extra data transformation.
compensation_bins = pd.IntervalIndex.from_tuples(
[
(i, i + 15_000)
for i in range(0, 200_001, 15_000)
],
closed="left"
)
For more information on IntervalIndex, and from_tuples in specific, please refer to its documentation and, alternatively, a Medium article I wrote to see it in action. But in TL;DR fashion, it creates proper bin ranges based on tuples, closed on the left.
Now we can take that IntervalIndex and use it to put the compensations in their respective bins.
data = pd.cut(
data["ConvertedComp"],
compensation_bins,
precision=0, # Compare values as integers
include_lowest=True # The first interval/bin should be left-inclusive
)
Oh and we need to store the bins as strings, as Plotly doesn't support IntervalIndex, or rather, category.
# Sort them inplace beforehand so they keep the order of the original tuples
data.sort_values(inplace=True)
data = data.astype("str")
fig = px.histogram(
data,
title="Annual Compensation (USD)",
)
fig.update_layout(
xaxis = {
"tickmode": "array",
"tickvals": data.unique(),
"ticktext": bin_labels
},
xaxis_title = "Annual Compensation",
yaxis_title = "Frequency",
title_x = 0.5,
bargap = 0,
showlegend = False
)
fig.update_traces(
marker = {
"line": {
"width": 2,
"color": "DarkSlateGrey"
}
}
)
fig.show()